R for Spatial Analysis

Author

Julio Novoa

Introduction to R for spatial analysis

Generated by AI

Spatial data types

The 2 basic spatial data types you will encounter are vector and raster data. Vectors are represented by point, lines, polygons, and raster data is represented by a matrix of elements.

Spatial data types

Coordinate reference systems

Geodetic Projected
Longitude/Latitude coordinates UTM coordinates (meters, feet)
NAD83, WGS84, … UTM, BC Albers, …

UTM projection

The following example shows Vancouver Island using 2 different CRS at the same screen scale. The distortions in size and shape are evident.

NAD83 vs BC Albers

Storage data formats

There are several storage formats for geospatial data, from legacy ones such as Shapefile or GeoTIFF to cloud-optimized ones such as GeoParquet, FlatGeobuf, COG, or Zarr.

In this tutorial we will use a GeoPackage file, which is a modern geospatial format that stores vector and raster data layers, as well as symbology and projects from QGIS. It is strongly recommended to switch from Shapefile to a better format such as GeoPackage due to several limitations of Shapefiles.

R Demo

Data management

In this section we will use the sf and mapview packages for data management and visualization. The sf package is used to access and perform spatial operations on geospatial data sets, while mapview allows us to do quick interactive visualizations.

Let’s set the working directory where all the data sets we need are stored, and define some variables names. This folder also contains a CSV file with points longitude/latitude coordinates.

Let’s load the R packages:

library(tidyverse)
library(sf)
library(mapview)
# =============================
# set working folder
setwd("~/Desktop/LGL_Introduction_R_Spatial")
# =============================

# gis data sets
gpkg_file = "gis_datasets.gpkg"
csv_file = "aerodromes.csv"

The next step is exploring the layers contained in the GeoPackage file. Here we will verify - for each layer - its name, its geometry, number of features (i.e. rows), number of fields (i.e. columns), and its CRS (coordinate reference system).

More information about CRS: Spatial Reference

st_layers(gpkg_file)
Driver: GPKG 
Available layers:
        layer_name     geometry_type features fields          crs_name
1 vancouver_island     Multi Polygon        1      8 NAD83 / BC Albers
2           places             Point      281     16 NAD83 / BC Albers
3     watercourses Multi Line String      674     16 NAD83 / BC Albers
4          samples             Point     2000      1     Undefined SRS

We will use the function st_read() to import each GeoPackage layer into an R variable. Each variable is a data frame, and they contain a new sticky column that stores the geometry of each feature.

# read GeoPackage layers
island = st_read(dsn = gpkg_file, layer = "vancouver_island")
places = st_read(dsn = gpkg_file, layer = "places")
watercourses = st_read(dsn = gpkg_file, layer = "watercourses")
samples = st_read(dsn = gpkg_file, layer = "samples")

Let’s verify the class of one of these layers.

class(places)
[1] "sf"         "data.frame"

They are of types sf and data.frame, so all tidyverse operations made on data frames can be executed on these layers, as well as several spatial operation using the sf package.

Let’s have a glimpse of the contents of the layer called places:

glimpse(places)
Rows: 281
Columns: 17
$ OBJECTID                  <dbl> 241, 242, 249, 250, 251, 252, 253, 254, 255,…
$ feature_id                <chr> "6374637b-72c6-4061-942b-62491e7c1701", "9e1…
$ jurisdiction              <int> 93, 93, 93, 93, 93, 93, 93, 93, 93, 93, 93, …
$ population_year           <chr> "2006", "2006", "2006", "2006", "2006", "200…
$ population_source         <int> 876, 878, 878, 878, 877, 877, 878, 877, 878,…
$ population_estimate       <int> 174461, 3000, 300, 200, 100, 497, 50, 198, 1…
$ capital                   <int> 884, 884, 884, 884, 884, 884, 884, 884, 884,…
$ place_name_descriptor     <int> 899, 899, 899, 899, 899, 899, 899, 899, 899,…
$ country                   <int> 140, 140, 140, 140, 140, 140, 140, 140, 140,…
$ place_name_db             <chr> "CGNDB", "CGNDB", "CGNDB", "CGNDB", "CGNDB",…
$ place_name_id             <chr> "bc5903fad05411d892e2080020a0f4c9", "1a22a40…
$ place_name_en             <chr> "Richmond", "Caulfeild", "Snug Cove", "Secre…
$ place_name_fr             <chr> "Richmond", "Caulfeild", "Snug Cove", "Secre…
$ place_name_other          <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ place_name_other_language <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ map_selection             <int> 71, 71, 71, 71, 71, 71, 71, 71, 71, 74, 74, …
$ geom                      <POINT [m]> POINT (1207465 463889.1), POINT (11991…

If we check the contents of any sf layer, we will see a header with information about the number of features (i.e. number of rows), the geometry type of these features (i.e points, lines, polygons), the dimensions of the data (e.g. XY, XYZ), the bounding box (i.e. minimum/maximum coordinates) and the CRS (i.e. geodetic, projected). After this header we can see the data frame contents, including the geometry column called geom in this example.

places |>
  select(place_name_en)
Simple feature collection with 281 features and 1 field
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 855892.1 ymin: 372000.2 xmax: 1213753 ymax: 657768.6
Projected CRS: NAD83 / BC Albers
First 10 features:
     place_name_en                      geom
1         Richmond  POINT (1207465 463889.1)
2        Caulfeild  POINT (1199194 484443.5)
3        Snug Cove  POINT (1193555 488820.8)
4      Secret Cove  POINT (1148312 504365.3)
5         Brew Bay    POINT (1117348 530199)
6     Myrtle Point  POINT (1109404 532847.3)
7      Saltery Bay  POINT (1130854 531363.3)
8  Mansons Landing  POINT (1073064 560778.1)
9        Whaletown  POINT (1068188 565499.5)
10      Port Hardy POINT (893691.2 634913.8)

Now, let’s create some maps using different plotting tools:

# using base R
plot(island$geom)
plot(places$geom, add = TRUE)

Using ggplot2, the following set of geometries and coordinates can be used to display sf objects:

  • geom_sf()

  • geom_sf_text()

  • geom_sf_label()

  • coords_sf()

# using ggplot
ggplot() +
  geom_sf(data = island) +
  geom_sf(data = places) +
  geom_sf_text(data=island, aes(label = "Van Isle"))

# using mapview
mapview(
  list(island, places)
)
mapview(
    select(places, place_name_en, population_estimate),
    layer.name = "Populated Places",
    col.regions = "red",
    label = "place_name_en",
    popup = TRUE
  ) +
  mapview(
    island,
    layer.name = "Vancouver Island",
    col.regions = "gray",
    alpha.regions = 0,
    label = NA,
    popup = FALSE
    )

It is always recommended to ensure that all the sf layers have the same CRS. Let’s verify again the GeoPackage layers:

# print GeoPackage layers
st_layers(gpkg_file)
Driver: GPKG 
Available layers:
        layer_name     geometry_type features fields          crs_name
1 vancouver_island     Multi Polygon        1      8 NAD83 / BC Albers
2           places             Point      281     16 NAD83 / BC Albers
3     watercourses Multi Line String      674     16 NAD83 / BC Albers
4          samples             Point     2000      1     Undefined SRS

The layer samples doesn’t have a CRS. For this exercise, we will assume all layers have the same CRS (i.e. BC Albers, EPSG=3005), so let’s assign that CRS to samples using the st_crs() function:

# assign CRS
st_crs(samples) = 3005
Warning: st_crs<- : replacing crs does not reproject data; use st_transform for
that
samples
Simple feature collection with 2000 features and 1 field
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 830015.5 ymin: 369787.2 xmax: 1196735 ymax: 650647.3
Projected CRS: NAD83 / BC Albers
First 10 features:
   elevation                      geom
1  271.85156 POINT (867972.5 647030.2)
2   55.33203  POINT (1169922 415364.5)
3  934.28906  POINT (1011965 504777.3)
4  602.26562  POINT (1162470 395771.9)
5  281.36328  POINT (1023983 484586.4)
6  248.04297  POINT (1183753 395494.5)
7  330.00000  POINT (1040468 569138.6)
8  696.16797 POINT (991193.7 568569.5)
9  902.84375 POINT (947531.7 600545.6)
10 380.98438  POINT (1123901 422778.5)

Now, we will read a CSV file that has points coordinates expressed as longitude and latitude. The sf points object will be created using the function st_as_sf(). Next, we will reproject these points from NAD83 (4269) to BC Albers (3005) using the st_transform() function, and finally we will extract the coordinates of these points in the BC Albers projection.

# read coordinates from a CSV file
csv = read_csv(csv_file)
Rows: 48 Columns: 7
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): interntl_civil_aviation_org_id, community_location, aerodrome_name
dbl (4): aerodrome_status, elevation, longitude, latitude

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# create Lon/Lat (NAD83) points layer
aerodromes = st_as_sf(
  csv,
  coords = c('longitude', 'latitude'),
  crs = st_crs(4269),
  remove = FALSE
  )

# reproject NAD83 points to BC Albers and extract coordinates
aerodromes_3005 = aerodromes %>% 
  st_transform(crs = 3005) %>% 
  mutate(bcalbers_x = st_coordinates(.)[, 1]) %>% 
  mutate(bcalbers_y = st_coordinates(.)[, 2])

# results
aerodromes_3005 |>
  as.data.frame() |>
  select(longitude, latitude, bcalbers_x, bcalbers_y) |>
  head(5)
  longitude latitude bcalbers_x bcalbers_y
1 -125.2339 49.96438  1054967.2   550119.8
2 -124.8880 49.71063  1080224.3   522234.4
3 -124.9845 49.67923  1073314.4   518638.4
4 -124.9375 49.32202  1077306.2   478989.1
5 -127.3667 50.68056   903483.8   630446.6

The next code block will show how to access ArcGIS Online data sets using the package arcgislayers. We will read neighborhood polygons and hedges locations from the Open Data Portal of the City of Victoria.

library(arcgislayers)

# read directly from ArcGIS Online

# Victoria neighborhoods
url_nhoods = "https://maps.victoria.ca/server/rest/services/OpenData/OpenData_Land/MapServer/4/query?outFields=*&where=1%3D1"
vic_nhoods = arc_open(url_nhoods) |> arc_select()

# Victoria business licences
url_business = "https://maps.victoria.ca/server/rest/services/OpenData/OpenData_PermitsAndLicences/MapServer/1/query?outFields=*&where=1%3D1"
vic_business = arc_open(url_business) |> arc_select()

mapview(
  list(vic_business, vic_nhoods),
  col.regions = list("green", "orange"),
  label = list(NA, "Neighbourhood"),
  legend = list(FALSE, FALSE),
  popup = list(FALSE, FALSE)
)

The arcgislayers package does work well with Quarto. Here are the same data sets, imported from a CSV and a GeoJSON file.

# Victoria business licences (CSV)
vic_business = st_read("business_licences.csv") |>
  st_as_sf(
    coords = c('X', 'Y'),
    crs = st_crs(26910),
    remove = FALSE
    )

# Victoria neighborgoods (GeoJSON) 
vic_nhoods = st_read("neighbourhood_boundaries.geojson") |>
  st_transform(crs = st_crs(vic_business))

Let’s check the geometry field in the neighborhoods layer. Note the geometry column for this layer is called geometry while in those layers coming from GeoPackage is called geom. These differences can cause problems when binding data frames.

attr(vic_nhoods, "sf_column")
[1] "geometry"

Let’s plot those layers:

mapview(
  list(vic_business, vic_nhoods),
  col.regions = list("green", "orange"),
  label = list(NA, "Neighbourhood"),
  legend = list(FALSE, FALSE),
  popup = list(FALSE, FALSE)
)

Spatial Operations

The most common spatial operations that can be performed on sf objects are:

  • spatial joining

  • aggregating

  • summarizing

The following example shows how to find the nearest aerodrome and calculate the distance:

# find the nearest aerodrome indices
nearest = st_nearest_feature(places, aerodromes_3005)
nearest
  [1]  8 35 35 22  2  2  2 13 12 30 32 25 38  2  7 16  6 37 23 25  2 42 42 46 44
 [26]  9  9 39 35 22 14  7 17 46 46 11 42  5 27 16 37 42  1 38 10 33 19 23 39  6
 [51] 19 39  7  7 19 44  9 19 39 19 19 19 42  8 22  7  7  2 18  1 12  4 23 44  9
 [76] 42 34 34 34 17 18 14 13 12  8 12  1 19 34 45 39  9 42 34  7 35 35 35 22  7
[101] 12 41 16 20 28  6 21 25 25 37  9 19 42 19 25 25 21 39 39 39  2  2 22  2 35
[126] 35 35  2  6  6 46 37  8  8  7  7  1 12 41  6  6 46 37 23  8 19 10 40 21 23
[151]  7  7 17 48 18 14  6  8 30  6 28 46 37 33 46 27 39 42 19 42 39  8 36 27  1
[176] 19 42 19 22  7  7  4  4 12  6 35 22  7  2  2  2 14 14 30 19 38 10  8  8  9
[201] 42  8 27 12  9  8  9  8  7 17  3 17  1 15 13 41  6  6 10 42 19  8 10  8 24
[226] 25 38  2 19 35  7  3 27 13 22 35  2 27 18 15 13 14 22 17 14 12 37 25 10  7
[251] 22  7  2  2  7  2  2  5  6 37 23 10  8 10 39 42 43  6 23 33 19 35 35 35  7
[276] 19  8  8 36 12 12
# join the closest aerodrome name and calculate distance
places = places |>
  mutate(
    closest_aero = st_join(places, aerodromes_3005, st_nearest_feature)[["aerodrome_name"]],
    closest_dist = st_distance(places, aerodromes_3005[nearest, ], by_element = TRUE)
  )
places |>
  as.data.frame() |>
  select(place_name_en, closest_aero, closest_dist) |>
  head(5)
  place_name_en              closest_aero closest_dist
1      Richmond           Nanaimo Airport 53344.86 [m]
2     Caulfeild   Nanaimo Water Aerodrome 52882.87 [m]
3     Snug Cove   Nanaimo Water Aerodrome 49548.63 [m]
4   Secret Cove Long Lake Water Aerodrome 35861.28 [m]
5      Brew Bay             Comox Airport 37968.51 [m]

In this next example we find populated places that are located within a 1-km buffer from watercourses:

# select places within a 1-km buffer around rivers
flooded_places = places[lengths(st_intersects(places, st_buffer(watercourses, 1000)))>0,]

flooded_places |>
  as.data.frame() |>
  select(place_name_en) |>
  head(5)
   place_name_en
24     Ehatis 11
31    Quinsam 12
32      Qualicum
39    Tsahaheh 1
41   Vernon Camp
# visualization of flooded places
mapview(places, col.region = "green", cex = 3) +
  mapview(flooded_places, col.region = "black", alpha.regions = 1) +
  mapview(watercourses, color = "blue") +
  mapview(st_buffer(watercourses, 1000), col.region = "pink")

Dissolving internal boundaries is a common spatial operation. Here we use tidyverse methods and the implicit function st_union() to get a layer of the boundary of the city:

vic_boundary = vic_nhoods |> summarise()
mapview(vic_boundary)

In the following example we count the number of business licences located in the North Park neighborhood:

# subset neighbourhoods: North Park
north_park = vic_nhoods |>
  filter(Neighbourhood == "North Park")

# select all hedges inside North Park
north_park_licences = vic_business[lengths(st_intersects(vic_business, north_park)) > 0, ]
mapview(
  north_park_licences,
  layer.name = paste("licences:", nrow(north_park_licences)),
  col.regions = "red") + north_park

Now, we count the number of business licences in all the neighborhoods:

# count number of business licences in each neighborhood
vic_nhoods$n_licences = lengths(st_intersects(vic_nhoods, vic_business))

mapview(
  filter(vic_nhoods, n_licences > 0),
  zcol = "n_licences",
  layer.name = "licences")

Data aggregation data can also be done by using Uber’s H3 global indexing system. Note that all sf objects - for the h3 package functions - must use CRS = 4326 (WGS84).

library(h3jsr)
# get h3 indices of hexagons located inside the neighborhoods
h3_idxs = polygon_to_cells(
  geometry = st_transform(vic_nhoods, 4326),
  res = 10
)

# create hexagons
h3_nhoods = cell_to_polygon(
  input = unlist(h3_idxs),
  simple = FALSE
)

mapview(
  list(vic_nhoods, h3_nhoods),
  color = list("purple", "black"),
  alpha.regions = list(0.5, 0),
  legend = list(FALSE, FALSE)
)

Now, let’s count the number of licences inside each hexagon:

# count number of licences per hexagon
h3_nhoods$n_licences = lengths(st_intersects(h3_nhoods, st_transform(vic_business, 4326)))

mapview(
  filter(h3_nhoods, n_licences > 0),
  zcol = "n_licences",
  layer.name = "licences")

Visualization

All the code blocks in this section must be copied and pasted into an R script.

First, we will access geospatial data sets hosted on Overture Maps cloud. The column-oriented data format is called GeoParquet mostly designed for distributed systems.

library(arrow)

# connect to Overture Maps buildings
buildings = open_dataset('s3://overturemaps-us-west-2/release/2024-09-18.0/theme=buildings?region=us-west-2')
# nrow(buildings). # 2+ billion buildings

Let’s download all the buildings in Victoria:

# extract bbox
vic_bbox = st_bbox(st_transform(vic_nhoods, 4326))

# read into memory all buildings in Victoria
vic_buildings = buildings |>
  filter(bbox$xmin > vic_bbox[1],
         bbox$ymin > vic_bbox[2],
         bbox$xmax < vic_bbox[3],
         bbox$ymax < vic_bbox[4]) |>
  select(id, geometry, height) |>
  collect() |>
  st_as_sf(crs = 4326) |>
  mutate(height = ifelse(is.na(height), 8, height))

# mapview
mapview(vic_buildings, layer.name = nrow(vic_buildings))

Using rdeck, 2D and 3D maps can be created. A Mapbox token is required. The easiest configuration is to save the token in the user’s .Renviron file.

# remotes::install_github("qfes/rdeck@*release")
library(rdeck)

rdeck(map_style = mapbox_dark(),
      initial_view_state = view_state(
        center = c(-123.3479, 48.45627),
        zoom = 11,
        bearing = 15,
        pitch = 85
      )) |>
  add_polygon_layer(
    data = vic_buildings,
    name = "Victoria, BC",
    get_polygon = geometry,
    get_elevation = height,
    get_fill_color = scale_color_linear(
      col = height,
      palette = viridisLite::inferno(100, direction = -1)
    ),
    extruded = TRUE,
    opacity = 0.2
    )

There are numerous mapping libraries for spatial data. Here is an example using mapgl. Some functions from this package also requires a Mapbox token.

library(mapgl)

maplibre() |>
 add_navigation_control() |>
 add_fullscreen_control() |>
 add_scale_control() |>
 fit_bounds(vic_buildings) |>
 add_fill_extrusion_layer(
   id = "Buildings",
   source = vic_buildings,
   fill_extrusion_color = "orange",
   fill_extrusion_height = list(
     "interpolate",
     list("linear"),
     list("zoom"),
     10,
     0,
     16,
     list("get", "height")
   )
 ) |>
 add_layers_control(collapsible = TRUE)

mapgl can also create heatmaps:

maplibre(bounds = h3_nhoods) |>
  add_heatmap_layer(
    id = "licences",
    source = vic_business,
    heatmap_color = interpolate(
      property = "heatmap-density",
      values = seq(0, 1, 0.2),
      stops = c('rgba(33,102,172,0)', 'rgb(103,169,207)',
                'rgb(209,229,240)', 'rgb(253,219,199)',
                'rgb(239,138,98)', 'rgb(178,24,43)')
    ),
    heatmap_opacity = 0.7
)

Suggested Reading